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INTRODUCTION 


Given  a  set   of  objects    each   described  by  a  vector  of   characteristics, 
a  clustering  technique   groups   those   objects  with  similar  characteristics 
together  into  subsets    called  clusters.      The   similarity   criterion  uses    an 
appropriate   distance  function  measuring  the   distance  between  objects  which 
varies  with  the   interpretation  of  the   characteristic  vector  for  the  set. 

Clustering  techniques  are  useful  in  many  areas.  For  example,  they  can 
be  used  in  medicine  to  identify  new  diseases  and  to  refine  existing  disease 
categories,  in  biology  to  develop  taxonomies  for  plants  and  animals,  and  in 
archaeology  to   classify   artifacts  with   respect  to  period  and  style. 

There   is   no  general  method  which   always  yields   useful   clusters   for  an 
arbitrary  set   of  objects.      Usually  different   techniques    are  tried,   and  often 
relevant   clusters   can  be   obtained  through   comparison   of  the   results.      Two   of 
the  most  popular  and  effective   techniques   are  the   single-link  method   [5] 
and  clique  generation. 

In  both  methods,   the  set  of  objects   is   interpreted  as   an  undirected 
graph.      For  a  given   distance   function,  we   can   define   a  threshold  <S    such  that 
if  the   distance  between  two  objects   is   less   than   6,   then  the  two   objects   are 
said  to  be  similar.      Using  this    concept,   the   set   of  objects    can  be   interpreted 
as   a  graph  where  nodes   represent   objects   and  edges    join  nodes   representing 
similar  objects. 

The   single-link  method  is   the  simplest   and  oldest  technique.      In  this 
method  a   cluster   is    defined  to  be   a  connected  component   of  the  graph.      A 
connected  component   is    a  subgraph  in  which   each  pair  of  nodes   is   joined  by 
a  path  or  sequence  of  edges .      For  objects  which  form  distinct   disconnected 
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clusters ,   the   single-link  method  often  yields   good  results    (see  Figure   la), 
and  the   simplicity  of  the  method  allows   it  to  be   applied  efficiently  to  large 
sets   of  objects.      However,   the  obvious    disadvantage   of  this   method  is   the 
chaining  effect,  where  pairs   of  nodes   may  be  joined  by   a  path  but  the   distance 
between  the  objects   they  represent   is   large   (see  Figure   lb). 


Figure   la.      Single- link   clusters 


Figure   lb.      Chaining   effect 


In   clique   generation   a  cluster  is   defined  to  be  a  maximal   complete  subgraph 
or  clique   of  the   graph.      A   complete  subgraph  is   one  in  which  every  node   in  the 
subgraph  is    adjacent  to   every   other  node   in  the   subgraph.      A  maximal   complete 
subgraph  is    a  complete   subgraph  which   is  not  properly   contained  in   any  other 
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complete  subgraph.      The   problem  of  finding  all  the   cliques    in   an   arbitrary 
graph  is  well  known,    and  many  algorithms  have  been  proposed.      The   earliest 
were   developed  by  Bonner  and  Bierstone    [l] ,  but  the  most   efficient   algorithm 
available  was    devised  by  Bron   and  Kerbosch   [2].      Unfortunately  finding   cliques 
is  much  more   difficult  than  finding  connected  components.      In   fact  the  problem 
of  finding  all   the    cliques   in   an   arbitrary  graph  is  polynomial-complete,   and 
hence   is   equivalent   in   difficulty  to   the  notoriously   difficult   "traveling- 
salesman"  problem. 

These  two  methods   illustrate  how  tightly   clusters    can  be   defined  by   lying 
at   the   extremes   of  any  reasonable   scale  of  tightness.      Cliques    contain  more 
information  about  the   structure   of  a  graph  than   connected  components,   and 
although  they  are  often  too  tight  to  be  used  as    clusters,   they   can   form  the 
nuclei  of  useful   clusters.      Hence  we  will  restrict   our   attention  to   clique 
generation. 

Chapter  2  presents   a  detailed  study   of  the  Bron-Kerbosch  algorithm.      The 
algorithm  is    described,   analyzed,   and  shown  to  be  near  optimal.      Chapter  3 
discusses   the    efficient   implementation  of  the   algorithm,    describes    an  efficient 
new   implementation,   and  presents  numerical   results    demonstrating  the  superiority 
of  the  new  implementation   over  previous   ones. 
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2.      THE  BRON-KERBOSCH  ALGORITHM 

2.1     Analysis 

Mulligan    [k]   studied  the   algorithms   of  Bonner,   Bierstone,   and  Bron   and 
Kerbosch  in   detail.      His   tests   showed  that  the  Bron-Kerbosch   algorithm  is 

l+£ 

fastest.   To  generate  n  cliques,  it  required  time  of  0(n    ),  where  e  is  a 
small  positive  quantity,  while  the  other  algorithms  required  time  of  0(n  ). 
Unfortunately  the  number  of  cliques  in  a  graph  generally  is  an  exponential 
function  of  the  number  of  nodes  in  the  graph.   Hence  the  speed  of  an 
algorithm  is  crucial. 

Bron  and  Kerbosch' s  presentation  of  their  algorithm  is  not  well  motivated. 
Hence  we  will  attempt  to  state  clearly  the  motivation  behind  their  algorithm 
and  to  explain  why  it  works  so  well. 

Figure  2  illustrates  how  clique  generation  can  be  applied  to  a  simple 
data  set.   Figure  2a  presents  the  adjacency  matrix  A  for  the  six  objects  in 
the  set.   Here  a. .  is  1  if  and  only  if  object  i  is  similar  to  object  j; 
otherwise  a. .  is  0.   Figure  2b  presents  a  graph  G  representing  the  data  set. 

-'-J 

The  nodes   in   G   correspond  to  the   objects   in  the   set,    and  an   edge   connects   any 
two  nodes    in   G   corresponding  to  .similar  objects.      Figure  2c  presents   a  graph 
T   summarizing  the  possible  ways   of  generating  the   cliques   of  G,    starting  with 
the   empty   set,    <j>.      Each  node   in   T   is   a   complete  subgraph,   and  each  edge  from 
a  node  a   in  level  £  to  a  node   3   in   level  I  +  1   is    labeled  with  the  node   of  G 
added  to  a   to   form  3.      A   clique   is   generated  by  traversing  a  path  or  sequence 
of  edges  which  terminates   in   a  clique. 

Obviously  one  way  to  generate   all  the   cliques   is   to  visit   every  node   and 
traverse   all  the  paths   in   T.      This    approach  is  time-consuming   and  wasteful 
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Figure   2a.      Adjacency  Matrix  A 


Figure  2b.      Graph   G 


level  0 


level  1     {1} 


level  2    {1,2}  {1,3}  {2.3} 


{2,U>  {3,5}  {3,6}  {5,6} 


level   3 


{1,2,3} 


{3,5,6} 


Figure  2c.   Clique  Generation  Graph  T 
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because  most  of  the  paths  lead  to  cliques  already  generated.  Most  early 
algorithms  perform  an  ordered  traversal  of  V,   but  this  is  still  wasteful 
because  subsets  of  already  generated  cliques  are  repeatedly  constructed. 
Bron  and  Kerbosch  used  a  cleverer  approach  and  -were  able  to  eliminate 
certain  paths  by  applying  the  ideas  formalized  in  the  following  lemmas. 

Lemma  1.   Suppose  the  paths  from  node  a  in  T   beginning  with  the  edge 
labeled  with  node  a  of  G  have  been  explored  so  that  all  cliques  containing 
a  U  {a}  have  been  generated.   Then  only  those  paths  from  a  beginning  with 
edges  labeled  with  nodes  of  G  not  adjacent  to  a  need  be  explored. 
Proof.   Let  C  be  any  clique  generated  by  exploring  a  path  from  a  beginning 
with  an  edge  labeled  with  a  node  adjacent  to  a.   Then  it  must  either  contain 
or  not  contain  nodes  not  adjacent  to  a.   Suppose  it  contains  such  a  node, 
say  b.   Then  clearly  it  can  be  generated  by  exploring  a  path  from  a  begin- 
ning with  the  edge  labeled  with  node  b  which  is  not  adjacent  to  a.   Suppose 
it  contains  no  such  nodes.   It  obviously  contains  ot ,  and  it  must  contain  a 
since  all  its  other  nodes  by  assumption  are  adjacent  to  a.   Therefore  it  has 
already  been  generated.   Q.E.D. 

Lemma  2.   Suppose  the  paths  from  node  a  in  T   beginning  with  the  edge  labeled 
with  node  a  of  G  have  been  explored  so  that  all  cliques  containing  a  U  {a} 
have  been  generated.   Then  at  any  node  3  of  T   which  properly  contains  a, 
those  paths  beginning  with  an  edge  labeled  with  a  can  be  ignored. 
Proof.   Suppose  a  path  from  3  beginning  with  an  edge  labeled  with  a  is 
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explored.      It  must   clearly  terminate   in   a  clique   containing  a  (J   {a}.      But 

"by  assumption,   all  the   cliques    containing  a  U    {a}  have  already  been  generated, 

and  hence  the  path   can  be   ignored.      Q.E.D. 

The  Bron-Kerbosch  algorithm  is   recursive;   upon   arriving  at   a  node  in 
level   &,   the   algorithm  calls    itself  to   explore  the   levels  higher  than    I. 
Lemma  1  is    applied  at   each   level.      Upon   first   arriving   at  node   a  in    T  at 
level  £,   the   algorithm  selects   a  node  of  G   called  FIXP  that   is    adjacent  to 
the  most  nodes   adjacent  to  the  partially   constructed  clique  a,   moves   to  the 
node  a     U    {FIXP},   and   calls    itself  to   construct   all  the    cliques    containing 
a  U    {FIXP}.      This    choice   of  FIXP   eliminates   the  maximum  number  of  paths    from 
a.      Upon   returning  to  node  a,   the   algorithm  chooses    a  node  of  G   called  SEL 
that   is    adjacent  to  the  nodes   in   a  but  not  adjacent  to  FIXP,  moves   to  the 
node  a  U    {SEL},    calls   itself  to   construct   all  the   cliques    containing  a  (J   {SEL}, 
and  repeats  this  procedure   for   all  such  nodes.      This  process   is   illustrated 
for  the  graph  G  of  Figure  2b   in   Figure   3. 

Lemma  1   cannot  be  used  to   eliminate   all  redundant   edges.      Note,    for 
example,   that  the   edge   labeled  6   from  node   {3}   in   Figure    3   is   traversed 
although  it   leads   to  the   clique   {3,5,6},   previously  generated.      However, 
lemma  2    can  be  used  to   eliminate  the   edge   labeled   5   from  node   {3,6},   and  the 
clique   is  not   regenerated. 

Lemma  1  is   used  only  once   at   each   level  for  the  node  FIXP.      Conceivably 
it   could  be   applied  repeatedly   for  every  node   SEL  at   each   level.      However, 
if  it    is    used  to   eliminate   edges    labeled  with  nodes    adjacent   to  SEL,    some 
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[9]    ^"1    [91^2     [l]/3 


{1} 


{6} 


{2,3} 


{1,2,3}  {3,5,6} 


[    l]  Select   3   as   FIXP  at   level  0,   and  move  to   {3}. 

[    2]  Select   1  as  FIXP  at   level  1,   and  move  to   {1,3}. 

[    3]  Select  2  as   FIXP  at   level  2,   move  to   clique   {1,2,3},   and  hack  up  to   {3}. 

[    h]  Ignore   2  because   2  is   adjacent  to   1,   FIXP  at   level  1. 

[    5]  Select    5  as   SEL  at   level  1   and  move  to   {3,5}. 

I    6J  Select   6  as  FIXP  at   level  2,  move  to   clique  {3,5,6},   and  back  up  to  {3}. 

[    7]  Select   6   as   SEL  at   level  1,   and  move  to   {3,6}. 

[   8J  Ignore   5  because   5  was    selected  at   {3},   and  back  up  to  <j> . 

[   9]  Ignore   1,   2,    5,   and  6  because  1,   2,    5,   and  6   are   adjacent  to   3,   FIXP  at 

level  0. 

[10]  Select   h   as  SEL  at  level   0,   and  move  to  {U}. 

[ll]  Select  2  as   FIXP  at   level  1,  move  to   clique   {2,^},  back  up   to   <j>,   and  stop, 


Figure    3.      Application   of  the  Bron-Kerbosch  Algorithm 
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previously  ignored  edges  may  have  to  be  traversed.      For  example,   if  lemma  1 
is   applied  at  node   {3}   in  Figure    3  for  node  SEL,  when  SEL  is   5,   the   edge 
labeled  6   can  be   ignored,  but   only  if  the  previously  ignored  edge  labeled  2 
is  traversed.      Hence,   the  lemma  should  be  used  selectively   so  that  the  number 
of  new  edges   to  be  traversed  is    less   than  the  number  of  edges   to  be  eliminated. 

The   elimination   of  all  redundant   edges   requires   additional  tests.      Thus 
if  lemma  1  is    applied  at  node   (3)   for  node   5,  the   edge  labeled  2  must  be 
traversed  to  generate  the   clique   {2,3,6},   if  this    clique   exists.      However  if 
a  test   reveals   that  nodes  2   and  6   are  not   adjacent,   this    clique   cannot   exist, 
and  the   edges   labeled  2   and  6   can  both  be   ignored. 

These  modifications  were  incorporated  into  the   Bron-Kerbosch  algorithm, 
but  the  performance   of  the   algorithm  was  not   improved  because  the  time 
required  to   perform  the   additional  tests  was    comparable  to  that   required  to 
traverse  the  redundant  edges.      Therefore   it   seems   unlikely  that  the  Bron- 
Kerbosch  algorithm  can  be   improved  significantly. 
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2.2     Bron-Kerbosch  Algorithm 

The   following  formulation   of  the  Bron-Kerbosch  algorithm  is   very  similar 
to  Mulligan ' s   formulation . 

ALG0RITHM_BR0N-KERB0SCH:    PROCEDURE; 

DECLARE   S  the   set  of  all  data  nodes, 
NIL  the   empty  set, 
C  a  global  integer  variable, 
COMPSUB  a  global  set   of  nodes  ; 

/*  COMPSUB   is  a  complete  subgraph-  containing  C  nodes    */ 
STEP_1:    /*   Initially  COMPSUB   is    empty,  none  of  the  nodes  have  been   explored, 
and  all  nodes    are   candidates  which  can  be   added  to  COMPSUB. 
Hence   call  EXTEND  with   arguments  NIL  and  S.    */ 
C  =  0; 

COMPSUB  =  NIL; 
CALL   EXTEND (NIL,S) ; 
EXTEND:    PROCEDURE ( EXPL,CAND)    RECURSIVE; 
DECLARE  EXPL  a  local  set  of  nodes , 

/*  EXPL  =   {a  e   S    \    a  adjacent   to  all  the  nodes   e   COMPSUB,    and 
all   cliques    containing  COMPSUB  U   {a}  have  been   generated}. 
Nodes   in   EXPL   are  not   added  to  COMPSUB  because  this  would 
lead  to   cliques   previously  generated.    */ 
CAND  a  local  set  of  nodes, 

/*  CAND  =   {a  e   S    |    a  adjacent  to   all  nodes  e    COMPSUB, 

a  4  EXPL},   the  set  of  nodes    called  candidates   that   can  be 
added  to  COMPSUB  to  form  new   complete  subgraphs.    */ 
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NEXPL  a  local  set  of  nodes , 

/*  NEXPL  =  {a  e   EXPL    |    a  adjacent  to   SEL},   the  new  set   of 
explored  nodes   constructed  in   STEP_3  of  EXTEND  for  the 
next   recursive   call  to  EXTEND.    */ 
NCAND  a  local  set   of  nodes, 

/*  NCAND  =   {a  £   CAND    |    a  adjacent  to  SEL,    a  ±  SEL},   the 
new  set   of  candidates    constructed  in   STEP_3  of  EXTEND 
for  the  next  recursive   call  to  EXTEND.    */ 
FIXP   a  local  variable  representing  one  node, 

/*  FIXP  is   the  first  node   e   EXPL   U  CAND  adjacent   to  the 
most  nodes    e   CAND.    */ 
SEL  a  local  variable  representing  one  node; 

/*   SEL  is   a  node   e   CAND  selected  to  be   added  to   COMPSUB.    */ 
STEP_2:    /*  Choose  FIXP   and  SEL.    */ 

FIXP  =   first  node  e   EXPL  U   CAND  adjacent  to  the  most  nodes   e    CAND; 
IF   FIXP  e   EXPL 

THEN  SEL  =   first  node   e   CAND  not   adjacent   to  FIXP; 
ELSE   SEL  =  FIXP; 
STEP_3:    /*  Add  SEL  to  COMPSUB,    increment   C,    and   construct  NEXPL  and  NCAND, 
Note  that   the  number  of  candidates    decreases    for  each   call  to 
EXTEND;   hence  EXTEND  always   returns.    */ 
NEXPL  =   {a  e   EXPL    |    a  adjacent   to  SEL}; 
NCAND  =   {a  e   CAND    |    a  adjacent  to  SEL,    a  j  SEL}; 
COMPSUB  =    COMPSUB   U   {SEL}; 
CAND  =   CAND  -    {SEL}; 
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EXPL  =   EXPL  U    {SEL}; 
C   =   C  +    1; 
STEP_H:    /*   If  NCAND  and  NEXPL  are   empty,   a  clique  has  "been   generated.    */ 
IF    (NEXPL  =  NIL)    &   (NCAND  =  NIL)   THEN  print  the   clique   of  C  nodes 
contained  in   COMPSUB ; 
STEP_5:    /*   If  NCAND  is  not   empty,    COMPSUB  can  he   extended  further.    */ 

IF  NCAND    -1=  NIL  THEN   CALL  EXTEND ( NEXPL, N CAND )  ; 
STEP_6:    /*  Either  NEXPL  is  not   empty  and  NCAND  is   empty  which  implies 
that   a  previously  generated  clique   is  heing   constructed,   or 
a  new  clique  has   "been  printed,   or  a  successful   return  from 
EXTEND  has   occurred.      Hence  back  up  by  removing  SEL  from 
COMPSUB.      If  possible,    select   a  new  SEL  and  attempt  to 
generate  more   cliques.      Otherwise   return.    */ 
C  =  C  -   1; 

COMPSUB  =    COMPSUB   -    {SEL}; 
IF  there   are  nodes   e   CAND  not   adjacent  to  FIXP 

THEN  select   the  first   such  node   as   SEL  and  go  to  STEP_3; 
ELSE   RETURN; 
END  EXTEND; 
END  ALGORITHM_BRON-KERBOSCH ; 

2 . 3     Moon-Moser  Graphs 

Bron   and  Kerbosch  tested  their  algorithm  on  the  Moon-Moser  graphs    [3] 
which   contain  more   cliques   per  node  than   any  other  graphs.      These  graphs 
have   3k  nodes   grouped  into  k  triplets,   and  each  node  is   adjacent  to  every 
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other  node  except  the  two  nodes  in  the  same  triplet.   The  graph  with  3k 

nodes  contains  3   cliques.   They  found  that  their  algorithm  constructed 

k 

all  the   cliques   in  the  Moon-Moser  graph  with  3k  nodes   in  time   of  0(3.1*+   ). 

We   can  gain  some    insight   into  this   result  by  counting  the  number  of  com- 
parisons  required  to  generate  the   cliques . 

For  a  Moon-Moser  graph  with  3k  nodes   grouped  into  the  triplets   {1,2,3}, 
{U,5,6},    ...,    {3k-2,    3k- 1,    3k},   the  number  of   comparisons,    c    ,    is   as    follows. 

K. 

Operation  Comparisons 

Find  FIXP  3k(3k-l) 

Construct    lists   EXPL,CAND  3k-l 

Call   EXTEND  c,     n 

k-1 

Find  next   SEL  1 

Construct   lists  NEXPL,NCAND  3k-l 

Call  EXTEND  c,     . 

k-1 

Find  next   SEL  1 

Construct   lists   NEXPL,NCAND  3k-l 

Call  EXTEND  c,    ., 

k-1 

This  is  the  best  case  because  the  choices  for  SEL  are  2  and  3,  and  finding  the 
next  SEL  requires  only  one  comparison.   Summing  these  counts  yields 

C.  =  3C,  ,  +  9k2  +  6k-l  . 
k     k-1 

This  linear  difference  equation  is  easily  solved  giving 

Ck  =  3k  i|1  (9i2  +  6i-l)3_i  . 
The  worst  case  occurs  when  the  choices  for  SEL  are  3k-l  and  3k.   In  this  case 


C 


k  =  3  i|1  (9i  +  12i-T)3" 


■1k- 


As   k  ■*■  °°,  both  sums    converge  yielding 


best  k+2.6053 

Lk         ~  J 


worst  k+2.6801 

Ck  ~3 

If  the  number  of   comparisons   is   an   adequate  measure   of  the   time  required  by 

the   algorithm,  then  the  Bron-Kerbosch   algorithm  operates    at  the  theoretical 

limit   of  0(3k)  . 
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3 .   IMPLEMENTATION 

In  most   areas  where   clustering  techniques   are   applied,   large   amounts   of 
data  are  generally  processed.      Clique  generation   is   often  used  as   an  important 
first  step  in   classifying  the    data.      To  make   clique   generation  practical   for 
large   data  sets,   it   is   essential  to   develop  an  efficient   implementation  of  the 
fastest   available   algorithm,   the  Bron-Kerbosch   algorithm. 

Bron  and  Kerbosch  implemented  their  algorithm  in  Algol,  while  Mulligan 
implemented  it  in  PL/I.  Since  their  implementations  are  identical,  we  will 
restrict  our  attention   to  Mulligan's. 

Mulligan's   implementation   is   fairly   fast.      However,   since   enormous 
amounts   of  time   are  generally  required  to  process    large   graphs ,    even  a  modest 
improvement   in  performance  is   of  practical  significance.      In  his   implementation, 
EXPL  and  CAND  are   concatenated  into   a  single   vector  of  integers  with  a  pointer 
indicating  the  boundary  between  the  two   lists.      A  selected  candidate   is   trans- 
ferred from  the   candidate  list  to  the   explored  list  by  exchanging   it  with  the 
first  node   in   the   candidate  list   and  incrementing  the  pointer  by  one.      This    data 
structure  has    certain   advantages .      Additions   to  and  deletions   from  the  lists    are 
simple,    and  the   determination   of  the    list   contents   is   trivial.      However,    it 
complicates   the   execution   of  the   principal  operations   of  finding  FIXP,   and 
constructing  the   lists   NEXPL  and  NCAND.      These   operations   must  be   performed 
serially  node  by  node   in   loops.      They   could  be  speeded  up  if  the   lists  were 
sorted,  but  this  would  require  a  more   elaborate  list   structure   and  additions 
to   and  deletions    from  the   lists  would  be  more   complicated. 

We  observed  that   the   principal  operations    can  be  written   in  terms    of  set 
intersections   as    follows: 
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FIXP  =   first  node   i   e  EXPL  U   CAND  for  which   CAND  fl    {nodes   e   S  adjacent 
to  i}  has  maximum  number  of  nodes, 

NEXPL  =  EXPL  fl    {nodes   e    S   adjacent   to  SEL}  , 

NCAND  =  CAND  fl    {nodes   e   S  adjacent  to  SEL}. 
If  the   lists    are  represented  by  bit   strings,   then  intersections   of  the  lists 
can  be    computed  rapidly  using  boolean  operations  which  perform  blocks  of 
comparisons   in  parallel.     Hence  we   chose   to  represent  the   lists   of   candidates 
and  explored  nodes    and  the   rows   of  the   adjacency  matrix  for  an  m-node   graph 
by  m-bit   strings.      This  new  data  structure  speeds  up  the  principal  operations, 
but   it  also  creates   new  problems.      The   determination  of  the  names    and  the 
number  of  nodes  in  a  list,   formerly  trivial,  now  is   fairly  difficult.      It 
would  defeat  the  purpose   of  the  new  data  structure  to  perform  these  operations 
bit  by  bit   in   a  high-level  language.      Hence  we   chose  to  implement  these 
operations    in  a  low-level  language.      An  efficient  subroutine   can  be  written 
to   count  the  one  bits   in   a  bit  string  using  the  IBM/360  logical  instruction 
"translate  and  test",  which  maps   a  byte   into  a  table.      Unfortunately,   there 
is  no   LBM/36O   instruction  which  extracts  the   locations   of  the   one  bits   in   a 
string.     However,   a  subroutine  which  rapidly  extracts  the  one  bit   locations 
can  be  written  using  a  fast  register  to  register  add  instruction. 

We   implemented  the  basic  algorithm  in  PL /I.      A  listing  of  the  PL/I 
procedure  EXTEND  and  the  Assembler  subroutines   is  presented  in  the  Appendix. 

3.1     Numerical  Results 


The  new  implementation  was    compared  to  Mulligan's   on   several  graphs. 
Unfortunately,   accurate  timing   results    could  not  be   obtained  because  of  the 
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local  multiprogramming  environment. 

Some  typical  results    are  shown  in  Tables   1  and  2.      Table   1  presents 
results   for  the  Moon-Moser  graphs  with   3k  nodes.      Since  the  time   estimates 
are   contaminated  with   random  errors,   least  squares    approximations   of  the 
form  ak  +  b  were   fitted  to  the   logarithms    of  the  times.      These   approximations 

indicate  that  the  time  required  to  generate   3     cliques   is   proportional  to 

k  k 

3.00     for  Mulligan's   implementation   and  to  2.99      for  the  new  implementation. 

Given  the  timing  errors,  we   can   conclude   that  the  actual   execution  time  is 

probably  proportional  to   3    . 

Table  2  presents    results   obtained  using  data  from  a  color-shape 
preference  test   for  preschool    children.      Each   child's   performance  is 
described  by  a   characteristic  vector  of  72  bits.      Two  performances  were 
judged  to  be   similar  if  6    or  more  bits   in  the   corresponding  vectors  matched. 
We   analyzed  an   80  node  graph  summarizing  the   data  for   80   children.      As   the 
threshold  6    decreases,   the  number  of  edges   in   the  graph  grows,   and  the  number 
of   cliques   increases   rapidly. 

In  both  examples,   the  new  implementation   is  superior  to  Mulligan's. 
Although   the   improvement  in   performance   is   relatively  modest,   it   is 
significant   in  view  of  the  high   cost   of   analyzing  large  data  sets. 
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k 

Time    (Seconds) 

Mulligan 

Present 

5 

1.U3 

.81+ 

6 

5.02 

2.1*8 

7 

lU.13 

l.h3 

8 

k2.kk 

22. 9k 

9 

129 .  82 

66. 3h 

Table   1.      Moon-Moser  Graphs  with   3k  Nodes 


Threshold 

Number 
of   Cliques 

Time    (Seconds) 

Mulligan 

Present 

33 

165 

8.78 

1.1+2 

31 

315 

lU.39 

2.87 

29 

730 

31.25 

6.10 

27 

23U8 

95-10    . 

20.57 

25 

7505 

3U6.78 

66.78 

Table  2.      Color-Shape  Preference  Test  Graph 
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APPENDIX 


PL/I  and  ASSEMBLER  PROGRAMS 
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EXTEND: 

PROCEDURE(CAND,EXPL,N)    RECURSIVE; 

/*   RECURSIVE  PROCEDURE  GENERATING  ALL  POSSIBLE  CLIQUES  EXTENDED 
FROM  THE  PARTIAL  SOLUTION  IN  "CCMPSUB"  USING  "CAND" 
INITIALLY,  CAND  CONTAINS  1  BITS  FOR  ALL  NODES  PRESENT 

EXPL  IS  THE  NIL  OR  ZERO  BIT  STRING 
GLOBALLY  DEFINED  VARIABLES  ARE: 

PRNTFLG  -  BIT  1  =>  CLIQUES  ARE  TO  BE  PRINTED 

BIT  0  =>  CLIQUES  ARE  NOT  PRINTED  JUST 
CCUNTED  IN  NUMOUT 
NUMCUT  -  COUNTER  OF  CLIQUES  WHEN  PRNTFLG  IS  BIT  0 
CONNECTED  -  N  DIMENSIONAL  VECTOR,  EACH  ELEMENT  IS 
OF  N  BITS.  (  AOJACENCY  MATRIX  ) 
CONNECTED(J)  -  BIT  STRING  REPRESENTING 

NODES  ADJACENT  TO  J 
VERTEX  -  N  DIMENSIONAL  VECTOR  LIKE  CCNNECTEO 

VERTEX(J)  IS  OF  N  BITS  WITH  A  1  BIT  IN  THE 
JTH  PCSITICN  ONLY  */ 

/*   ASSEMBLER  SUBROUTINES  */ 

/*   CCUNT(N6,STR,CT)  IS  A  SUBROUTINE  THAT  CCUNTS  UP  THE 
NUMBER  OF  1  BITS  IN  A  BIT  STRING: 

CT  =  *  OF  1  BITS  IN  BIT  STRING  STR,  WHERE  STR  IS  OF  LENGTH 
NB  BYTES  */ 

DCL  COUNT  OPTICNS(ASM) 

ENTRY(FIXEL)  BIN  (  15,  0  )  ,  BIT  (  *  J, FIXED  BINU5,OJ), 
/*   XTRACT(NB, STR, LIST, M)  IS  A  SUBROUTINE  THAT  EXTRACTS  THE 
POSITIONS  OF  1  BITS  IN  A  BIT  STRING 
LIST  =  LIST  OF  POSITIONS  OF  1  BITS  IN  THE  BIT  STRING  STR, 

WHERE  STR  HAS  LENGTH  NB  BYTES,  ANO 
M  =  NUMBER  OF  ELEMENTS  IN  LIST  */ 

XTRACT  OPTICNS(ASM) 

ENTRYCFIXED  BINI 15 , 0) , BI T (*  ),(*)  FIXED  BIN(15,0), 
FIXED  B1N( 15,0)), 
CAND  BIT<*),  /*   CANDIDATES  PASSED  */ 

EXPL  BIT(*>,  /*   EXPLORED  NODES  PASSED  */ 

(NCAN,NEXP)  BIT(N),  /*   NEW  CAND,  NEW  EXPL  */ 

NB  FIXED  BIN(15,0),  /*   NO.  OF  BYTES  IN  BIT  STRINGS    */ 

ASEL  BIT(N),  /*   LIST  OF  FUTURE  SELECT  NODES    */ 

FIXP  FIXED  BIN(15,0),       /*   MOST  CONNECTED  NCOE  W.R.T.  THE 

CANDIDATES  */ 

RLIST(N)  FIXED  BIN(15,0),   /*   RETURN  LIST  FROM  ASM  ROUTINES  */ 
ZERO  BIT(N),  /*   ZERO  BIT  STRING  */ 

/*   CT,  CNT  ARE  COUNTERS  OF  1  BITS 
SEL  IS  THE  SELECTED  NODE 
IFL  IS  A  FLAG 

CTHERS  ARE  INDEXING  VARIABLES  */ 
<I,CNT,CT,IFL, IS, SEL, J, M)  FIXED  BIN(15,0); 


-22- 


ASEL    =    ASEL    &    -.ASEL;  /*       INITIALIZE  */ 

zero*asel; 
ifl=o; 

NB=N/8;  /*   ASSUME  N  DIVISIBLE  BY  8        */ 

CNT=0; 

/*   GCING  THRU  CANC  AND  EXPL  LISTS  TO  FIND  THE  NODE  THAT  IS 

CONNECTED  TO  THE  MOST  CANDIDATES  */ 

IF  ZERC=EXPL  THEN  GO  TO  SKPl; 

/*   INTEGER  REPRESENTATION  OF  ALL 

EXPL  NODES  INTO  RLIST  */ 

CALL  XTPACTCNB, EXPL, RLIST, M); 

DO  1=1  TC  M;  /*   SEARCH  THRU  EXPL  LIST  */ 

J=RLIST(I); 

/*   COUNT  CONNECTIONS  TO  CANDS.    */ 
CALL  COUNT(NB,CONNECTED(Ji  L    CAND,CT); 

IF  CT>CNT  THEN  /*   FIND   MOST  CONNECTED  NODE      */ 

DO; 

CNT=CT;      FIXP=J;     ASEL=  -.CONNECTED!  J  )  £  CAND; 

end; 
END; 
SKPl:  IF  ZERG=CAND  THEN  GO  TO  SKP2; 

CALL  XTRACT(NB,CAND,RLIST,M);   /*   EXTRACT  NODES  FROM  CAND  LIST   */ 
DO  1=1  TO  M;  /*   SEARCH  THRU  EXPL  LIST  */ 

J=FLIST(I); 

CALL  CCUNT(NB,CONNECTED(J)  &  CANO,CT); 
IF  CT>CNT  THEN 
DO; 

CNTSCT;    FIXP=J;    IFL-l;    ASEL  *  -.CONNECTEDC Jl  6  CAND; 
END; 
END; 
SKP2:   CALL  XTRACT< NB, ASEL, RLIST, M) ;   IS=1; 

IF  IFL=0  THEN  /*   IFL=0  =>  FIXP  IS  IN  EXPL  LIST  */ 

DO;  /*   SELECTED  NODE  MUST  BE  A  CAND. 

INSTEAD  OF  FIXP,  CHOOSE  A  CAND. 
NOT  CONNECTED  TO  FIXP  */ 

SEL=RLIST(IS);      IS*IS*l; 
END; 
ELSE  SEL=FIXP;  /*   ELSE  FIXP  IS  A  CAND,,  CHOOSE 

FIXP  */ 
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DO  WHILE  (IS  <=  M+l);  /*   feHlLE  THERE  ARE  STILL  SEL'S    */ 

/*   CONSTRUCT  NEW  CAND,  EXPL       */ 
NEXP  =  EXPL  £  CGNNECTED<SEL); 
NCAN  =  CAND  £  CONNECT  ED(  SEL  )  £  -»VERTEX(  SEL  J  ; 

c=c+i; 

COMPSUB(C)=SEL;  /*   AOD  SELECTED  NODE  TC  PARTIAL 

SOLUTION  OF  CLIQUE  */ 

/*   NC  MORE  CANDICATES  AND  EXPLORED 
NODES  =>  A  CLIQUE  IS  FOUND     */ 
IF  (NCAN  I  NEXP)  =  ZERO  THEN 
IF  PRNTFLG  THEN 

PUT  EDIT(  (LOMPSUB(I)  DO  1  =  1  TO  O)  <  SKIP,  <C)F(3i  )  ; 
ELSE  NUMGUT=NUMOUT+l; 

/*  IF  THERE  ARE  MORE  CANDIDATES  TO 
BE  EXPLORED  ,  CALL  EXTEND  TO  GC 
DOWN  ONE  MORE  LEVEL  */ 

ELSE  IF  NCAN  -=  ZERO  THEN  CALL  EXTEND( NCAN , NEXP ,N) ; 
C=C-l;  /*   DELETE  EXPLORED  NODE  FROM 

CLIQUE  */ 

/*   DELETE  EXPLD-  NODE  FGRM  CAND   */ 
CAND  =  CAND  £  -.VERTEX  (  SEL  )  ; 

/*   ADD  EXPLORED  NODE  TO  EXPL      */ 
EXPL  =  EXPL  I  VERTEX(SfcL); 

SEL=FLIST( IS);  /*   SELECT  ANOTHER  NCDE  CUT  OF  LIST 

CF  CANDIDATES  NOT  CONNECTED  TO 
FIXP  »/ 

IS=IS+l; 
END; 
END  EXTEND; 
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COUNT 


LOOP 


OUT 
TABLE 


BEGIN 

L 

L 

L 

LA 

LA 

LH 

AR 

SR 

LA 

LA 

TRT 

CR 

BC 

AR 

LR 

B 

STH 

LEAVE 

DC 

DC 

DC 

DC 

DC 

DC 

OC 

DC 

DC 

DC 

DC 

END 


3,0(0,1) 

4,4(0,1) 

5,8(0,1) 

9,1 

8*0 

6,0(0,3) 

6,4 

4,9 

2,0 

lfO 

1(256,4) 

6,1 

12, OUT 

8,2 

4,1 

LOOP 

8,0(0,5) 


R3=ADDR(LENGTH   OF    STRING    IN    BYTES) 

R4=ADUR( STRING) 

R5=ADDR( COUNT) 

R9=C0NSTANT    1 

R8=C0UNT    REGISTER 

OBTAIN   LENGTH    OF    STRING 

R6=ADDR    OF    END    OF    STRING 

SCAN    START   ONE    BYTE    TO    THE    LEFT 

CLEAR    R2,R1 

,TABLE  TRANSLATE  AND  TEST 

HAS    CAN    REACHED    END? 

NO,   ADD  TO  COUNT  REGISTER 

RESET  SCAN  ADDR 

BRANCH  BACK 

YES.    STORE  INTO  COUNT 


X'0001010201 
X» 0203030403 
X' 0203030403 
X' 0203030403 
X» 0203030403 
X1 0405050605 
X' 0203030403 
X' 0304040504 
X' 0203030403 
X1 0405050605 
X' 0405050605 


0202  03  010202030  20303040102020302030304' 
040405  01020203020303040203030403040405' 
040405  03040405040505060102020  302030304 • 
040405020303040 30404050304040504050506 • 
040405  03040405040505060304040  504050506' 
06060701020203020303040203030403040405' 
0404050^04040 504050506 0203030 403 040405' 
050506  0304  0405040505060405050605060607' 
04040503040405040505060304040504050506' 
0606070304040504050506040  5050  605060607' 
0606070506060706070708 • 
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XTRACT 


BEGIN 


L 

2,0(0,1) 

L 

3,4(0,1) 

L 

4,8(0,1) 

LA 

<3,8 

LA 

6,1 

LA 

3,0 

LA 

8,0 

LA 

10,2 

LOP 

IC 

7,0(0,3) 

SLL 

7,24 

LDP1 

AR 

8,6 

ALR 

7,7 

BC 

12,N0AD 

STH 

8,0(0,4) 

BC 

2, NXT 

AR 

4,10 

B 

L0P1 

NXT 

AR 

4,10 

B 

NEXT 

NCAD 

BC 

4,L0P1 

NEXT 

AR 

5,6 

LR 

8,5 

SLA 

8,3 

AR 

3,6 

CH 

5,0(0,2) 

BC 

4, LOP 

L 

7, 12(0,1) 

S 

4,8(0,1) 

SRA 

4,1 

STH 

4,0(0,7) 

R2=ADDR  OF  N,  #  OF  BYTES 

R3=ADDR  OF  STRING 

R4=ADDP  OF  PASSED  LIST 

R9=C0NSTANT  8 

R6=CCNSTANT  1 

R5=BYTE  COUNT 

R8=BIT  CCUNT 

R10=C0NSTANT  2 

PUT  ONE  BYTE  IN 

SHIFT  LEFT 

UP  BIT  COUNTER 

SHIFT  LEFT  ONE 

CARRY?      NO, 

YES.  STORE  BIT 

ZERO?   YES,  NEXT  BYTE 

READY  FOR  NEXT  STORE 

BACK  FOR  ANOTHER  SHIFT 

READY  FOR  NEXT  STORE 


R7 


GO  TG  NCAD 
INDEX 


NO  CARRY.  NOT  ZERO. 
UP  COUNT  OF  BYTES 


SHIFT  AGAIN 


RESET  BIT  COUNTER 

NEXT  BYTE  ON  STRING 

4    CF  BYTES  EXCtED  LENGTH? 

NO.  GET  ANCTHER  BYTE 

R7  =  ADDR  OF  FOURTH  PARAMETER 

COUNT  H    CF  INDEX  STORED 
STGRb  AND  RETURN 


LEAVE 
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